[Metax][Optimization] Optimize PaddleOCR-VL vision path on Metax GPU by Dryoung95 · Pull Request #7619 · PaddlePaddle/FastDeploy

Dryoung95 · 2026-04-24T16:52:06Z

Motivation

This PR optimizes the PaddleOCR-VL vision path on Metax GPU.

During profiling, extra overhead was observed around extract_vision_features_paddleocr(), especially in vision metadata preparation, position embedding preparation, and projector-side data
organization. This PR reduces unnecessary host/device synchronization and repeated small tensor operations while keeping the existing vision computation semantics unchanged.

Modifications

This PR updates the following files:

fastdeploy/model_executor/models/paddleocr_vl/projector.py
fastdeploy/model_executor/models/paddleocr_vl/siglip.py
fastdeploy/model_executor/models/paddleocr_vl/siglip_ops.py
fastdeploy/worker/metax_model_runner.py
tests/model_executor/test_paddleocr_vl_vision_path.py

Main changes:

Optimize PaddleOCR-VL projector-side packing flow and support returning packed image features directly.
Reuse host-side grid_thw_lst metadata in extract_vision_features_paddleocr() to avoid unnecessary tensor-to-CPU synchronization on the FD_ENABLE_MAX_PREFILL=1 path.
Optimize packed position embedding preparation in Siglip vision embeddings.
Support the batch=1 fast path in Siglip attention and encoder layer while sharing the encoder-layer computation logic.
Make rotary embedding precision explicit by requiring float32 input in apply_rotary_pos_emb_vision().
Add unit tests for the changed PaddleOCR-VL vision-path logic.

Usage or Command

Unit test added for this PR:

  python -m pytest -q tests/model_executor/test_paddleocr_vl_vision_path.py

Local validation commands:

  python -m ruff check \
    fastdeploy/model_executor/models/paddleocr_vl/projector.py \
    fastdeploy/model_executor/models/paddleocr_vl/siglip.py \
    fastdeploy/model_executor/models/paddleocr_vl/siglip_ops.py \
    fastdeploy/worker/metax_model_runner.py \
    tests/model_executor/test_paddleocr_vl_vision_path.py

  python -m black --check \
    fastdeploy/model_executor/models/paddleocr_vl/siglip.py \
    fastdeploy/model_executor/models/paddleocr_vl/siglip_ops.py \
    fastdeploy/worker/metax_model_runner.py \
    tests/model_executor/test_paddleocr_vl_vision_path.py

Performance validation on Metax GPU:

Single request: 7.4039s -> 7.3870s, about -0.23%
4x8 small concurrency average: about -6.07%
4x8 small concurrency P50: about -5.97%
4x8 small concurrency P95: about -9.73%

Accuracy Tests

This PR keeps the PaddleOCR-VL vision math semantics unchanged and only reduces unnecessary data organization and metadata movement overhead.

Validation performed:

Service starts successfully.
/health returns 200.
/v1/models returns normally.
Single OCR request returns normally.
4x8 small concurrency requests complete stably.
Added unit tests covering projector packing, Siglip batch=1 fast path, position embedding cache/cast path, and native Neox rope embedding 2D/3D paths.

Local test result:

  python -m pytest -q tests/model_executor/test_paddleocr_vl_vision_path.py
  # 7 passed

Checklist

Add at least a tag in the PR title.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-24T16:52:12Z

Thanks for your contribution!

CLAassistant · 2026-04-24T16:52:13Z

All committers have signed the CLA.

Dryoung95 · 2026-04-24T17:00:20Z

目前没有看到编译失败、测试失败或 smoke test 失败的代码级证据，更像是 runner / Jenkins remoting 层问题。

麻烦帮忙 rerun 一下这条检查。

codecov-commenter · 2026-04-27T04:04:43Z

Codecov Report

❌ Patch coverage is 93.67089% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@73f11e0). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...y/model_executor/models/paddleocr_vl/siglip_ops.py	80.00%	2 Missing and 2 partials ⚠️
...eploy/model_executor/models/paddleocr_vl/siglip.py	96.42%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7619   +/-   ##
==========================================
  Coverage           ?   71.87%           
==========================================
  Files              ?      396           
  Lines              ?    55493           
  Branches           ?     8689           
==========================================
  Hits               ?    39884           
  Misses             ?    12854           
  Partials           ?     2755

Flag	Coverage Δ
GPU	`71.87% <93.67%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Dryoung95 · 2026-04-28T05:16:57Z

@luotao1 麻烦老师触发一下CI

PaddlePaddle-bot · 2026-04-28T08:39:36Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-29 13:33:06

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 9ede0f3
Merge base: 73f11e0 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

⏳ CI 仍在运行中，4 个 Required 任务尚未完成，暂无失败任务，请等待完成后查看最终结果。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
28(0)	28	21	0	5	2	0

2 任务状态汇总

2.1 Required任务 : 4/8 通过

必选任务阻塞合并，目前 4 个仍在运行中，请等待完成。

状态	任务	耗时	日志	重跑
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	CI 详情	-
⏳	`Run Base Tests / base_tests`	-	CI 详情	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	CI 详情	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	CI 详情	-
✅	其余 4 个必选任务通过	-	-	-

2.2 可选任务 — 17/20 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
⏳	`Trigger Jenkins for PR`	-	CI 详情	-
⏸️	`CI_HPU`	-	-	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
✅	其余 17 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-04-29 13:28:10

📋 Review 摘要

PR 概述：针对 Metax GPU 优化 PaddleOCR-VL 视觉路径，减少 host/device 同步和重复小张量操作，提升并发推理性能。
变更范围：model_executor/models/paddleocr_vl/（projector、siglip、siglip_ops）、worker/metax_model_runner.py、单测
影响面 Tag：[Models] [Metax] [Optimization]

📝 PR 规范检查

标题包含 [Metax] 和 [Optimization] 均为官方 Tag，格式合规；描述包含 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 全部必填 section，内容充实，Checklist 勾选状态与 diff 一致。✅ PR 规范合规。

问题

级别	文件	概述
🟡 建议	`siglip.py:131`	`assert` 用于运行时 batch=1 校验，Python `-O` 下失效
🟡 建议	`metax_model_runner.py:507`	`assert` 用于运行时 grid_thw 一致性校验，Python `-O` 下失效
❓ 疑问	`siglip_ops.py:40`	`apply_rotary_pos_emb_vision` 接口变更为强制 float32，需确认 `neox_rope_embedding` 已同步

总体评价

本 PR 通过批量化投影、融合元数据构造、LFU 位置编码缓存复用、mm_hash 去重等手段系统性地降低了视觉路径开销，思路清晰，单测覆盖完整。两处 assert 用于运行时防御性校验建议替换为 raise ValueError，以避免 Python 优化模式下静默失效；apply_rotary_pos_emb_vision 的 float32 接口约束请作者补充说明 neox_rope_embedding 侧是否同步。

PaddlePaddle-bot · 2026-04-29T05:35:31Z

    ):
-        B, seq_length, D = hidden_states.shape
+        if hidden_states.dim() == 3:
+            assert hidden_states.shape[0] == 1, f"SiglipAttention only supports batch=1, got {hidden_states.shape}"


🟡 建议 assert 被用于运行时 shape 校验

在 Python -O 模式下 assert 会被完全跳过，导致校验静默失效。

建议改为显式异常：

if hidden_states.shape[0] != 1: raise ValueError( f"SiglipAttention only supports batch=1, got {hidden_states.shape}" )

PaddlePaddle-bot · 2026-04-29T05:35:31Z

 def apply_rotary_pos_emb_vision(x, cos, sin):
-    orig_dtype = x.dtype
-    x = x.astype("float32")
+    assert x.dtype == paddle.float32, f"expected float32, got {x.dtype}"


❓ 疑问 apply_rotary_pos_emb_vision 接口变更：要求调用方保证 float32 输入

native_neox_rope_embedding 已在调用前正确完成 cast。但 SiglipAttention.forward 实际调用的是签名为 (qkv, cos, sin, num_heads, head_dim) 的 neox_rope_embedding（疑为 custom op），其未出现在本 PR diff 中。若该函数内部直接调用 apply_rotary_pos_emb_vision 且不保证 float32 输入，将在运行时触发 assert 失败。

请确认 neox_rope_embedding 已同步处理 float32 保证，或说明其不经过此函数。

PaddlePaddle-bot · 2026-04-29T05:35:31Z

+                                else:
+                                    grid_thw_tensor = paddle.to_tensor(grid_thw_key, dtype=paddle.int64)
+                                    multi_vision_inputs["images_lst"].append(
+                                        paddle.to_tensor(


🟡 建议 assert 被用于运行时数据一致性校验

Python -O 模式下会失效，建议改为：

if pending_mm_grid_thw[mm_hash] != grid_thw_key: raise ValueError( f"mm_hash {mm_hash} grid_thw mismatch: " f"{pending_mm_grid_thw[mm_hash]} != {grid_thw_key}" )

Dryoung95 had a problem deploying to Metax_ci April 24, 2026 16:52 — with GitHub Actions Error

perf(metax): streamline PaddleOCR-VL vision path

162c6f5

Dryoung95 force-pushed the codex/opt5-metax-paddleocr-vl branch from c6b3374 to 162c6f5 Compare April 24, 2026 16:54

Dryoung95 had a problem deploying to Metax_ci April 24, 2026 16:54 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

luotao1 assigned HackathonBot and luotao1 and unassigned HackathonBot Apr 27, 2026

luotao1 added contributor External developers PaddlePaddle Hackathon labels Apr 27, 2026

style: format PaddleOCR-VL Metax changes

f0bce32

Dryoung95 temporarily deployed to Metax_ci April 27, 2026 10:04 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

Dryoung95 changed the title ~~[Metax][Optimization] 优化 PaddleOCR-VL 在 Metax GPU 上的视觉路径开销~~ [Metax][Optimization] Optimize PaddleOCR-VL vision path on Metax GPU Apr 28, 2026

test: cover PaddleOCR-VL vision path changes

0efdcbc

Dryoung95 had a problem deploying to Metax_ci April 28, 2026 12:12 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

ci: retrigger workflows after first merged PR

b283a93

Dryoung95 had a problem deploying to Metax_ci April 29, 2026 05:10 — with GitHub Actions Error

style: sort PaddleOCR-VL test imports

9ede0f3

Dryoung95 temporarily deployed to Metax_ci April 29, 2026 05:19 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot reviewed Apr 29, 2026

View reviewed changes

Conversation

Dryoung95 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 24, 2026

Uh oh!

CLAassistant commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dryoung95 commented Apr 24, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

Dryoung95 commented Apr 28, 2026

Uh oh!

PaddlePaddle-bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 4/8 通过

2.2 可选任务 — 17/20 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Dryoung95 commented Apr 24, 2026 •

edited

Loading

CLAassistant commented Apr 24, 2026 •

edited

Loading

codecov-commenter commented Apr 27, 2026 •

edited

Loading

PaddlePaddle-bot commented Apr 28, 2026 •

edited

Loading